DOCUMENT RESUME 



ED 453 266 



TM 032 803 



AUTHOR 

TITLE 

PUB DATE 
NOTE 



PUB TYPE 
EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Sykes, Robert C. ; Truskosky, Denise; White, Hillory 
Determining the Representation of Constructed Response Items 
in Mixed- Item Format Exams. 

2001-04-00 

42p . ; Paper presented at the Annual Meeting of the National 
Council on Measurement in Education (Seattle, WA, April 
11-13, 2001) . 

Reports - Research (143) -- Speeches/Meeting Papers (150) 

MF01/PC02 Plus Postage. 

♦Constructed Response; Elementary Education; *Elementary 
School Students; Error of Measurement; Item Response Theory; 
Mathematics Tests; *Reliability; Scores; Test Construction; 
Test Format; *Test Items; Writing Tests 
Unidimensionality (Tests) ; ^Weighting (Statistical) 



ABSTRACT 



The purpose of this research was to study the effect of the 
three different ways of increasing the number of points contributed by 
constructed response (CR) items on the reliability of test scores from 
mixed- item- format tests. The assumption of unidimensionality that underlies 
the accuracy of item response theory model-based standard error predictions 
of reliability was initially evaluated for these tests. Large samples of 
students who had taken mixed-format field tests in mathematics at grades 5 
and 8 and writing at grades 3 and 8 were available from a state 
criterion-referenced testing program. The selection of subsets of items from 
test-blueprint-representative forms of similar content and difficulty 
permitted an evaluation of the effects of weighting CR items on total test 
scores relative to criterion scores of putatively greater generalizability . 

As expected, there was a cost in terms of precision of having fewer, though 
weighted, CR items across a wide range of ability. The increment in standard 
error attributed to weighting was predictably less in the middle of the scale 
where the forms were targeted. The magnitude of the increase in error and the 
particular portion of the scale where it occurs are determined by the 
locations and amount of information contributed by the deleted CR items 
relative to those that are retained. Implications of different approaches to 
weighting are discussed. (Contains 5 tables, 10 figures, and 10 references.) 
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INTRODUCTION 



Constructed response (c.r.) items are now frequently found 
complementing multiple choice (m.c.) items in mixed-format 
examinations. These items are believed important in their 
capability to influence curriculum through their assessment of 
skills not evaluated by m.c. items, such as organized or creative 
expression, while the m.c. items allow a breadth of content 
coverage by an evaluation of content or factual knowledge. The 
employment of IRT models allows both types of items to be scaled 
together, providing the advantages of a single score if the 
assumptions of the model such as unidimensionality are met. Traub 
(1993), in a review of the studies that existed at that time, 
suggested that the items of the two formats probably do not 
measure different characteristics for tests in the Quantitative 
or Reading Comprehension domains but may measure different 
characteristics for Writing. 

The use of both the c.r. and m.c. item formats requires a 
determination of the degree to which they will be represented or 
weighted. One manner of defining the contribution the c.r. items 
will make to the total test score, as well as that of the m.c. 
items, is through the items' psychometric characteristics. 
Specifically, the use of IRT (pattern) scoring implies that a 
decision has been made to weight each item by its reliability 
(i.e. discrimination). This type of psychometrically imposed 
weighting, resulting in total test scores that are optimal in 
terms of reliability, may be contrasted to the test-designer 
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imposed weighting of item formats that is the subject of this 
research. Because a set of c.r. items is not likely to produce a 
total score with reliability as great as a set of m.c. items 
administered in the same period of time (Wainer & Thissen, 1993), 
a rationale for test-designer imposed weighting would presumably 
be that they are desired to increase the validity of the 
examination . 

Three different types of test-designer imposed weighting 
utilizing number-correct scoring with the employed IRT model are 
possible. (The assignment of the worth or point value of each 
type of item is another method of weighting items that is not 
considered here.) The first of these methods of weighting is 
through the specification of the test blueprint (i.e. blueprint 
representation) . The representation of c.r. items in a test 
(i.e. relative proportion of total score points contributed by 
the c.r. items) is determined through this method by the 
stipulation of the number of c.r. items required in those 
categories assessing skills that can only be evaluated by these 
items and the number of c.r. items from categories that can be 
evaluated using either c.r. or m.c. items. 

The number of c.r. items in these latter categories can vary 
depending upon the availability or desirability of c.r. items. 
Relatively large numbers of c.r. items may be necessary for a 
test if there are many categories of the former type and/or c.r. 
items are preferred to fill the latter type of blueprint 
categories. 
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Because c.r. items generally require longer response times, 
however, it may not be feasible to administer as many as are 
desired within the time available for testing. Testing time is 
especially a problem when the c.r. items require an extended 
response (e.r.), such as the writing samples given in response to 
a prompt. It may not be possible to administer more than one of 
these e.r. items, along with the accompanying m.c. and other c.r. 
items . 

Although administering a larger number of e.r. or c.r. items 
would be desirable from the standpoint of the generalizability of 
test scores, it is possible to increase the number of points 
coming from a set of c.r. items without increasing their number 
(and testing time) . A second possible type of weighting is 
implemented by multiplying the portion of the test characteristic 
curve (tcc) that is contributed by these items by an integer 
factor (i.e. tcc component weighting) . Thus if it was desired to 
increase the number of points contributed to the total test score 
by a single e.r. response from six to 12 points the expected e.r. 
score would be multiplied by two. The increased expected item 
score is then added to those for the other items to obtain the 
expected total raw score for scale scores across the scale and 
thus the scoring tables. 

Ito and Sykes (2000) examined the effect of weighting sets 
of c.r. items through the test characteristic curve relative to a 
criterion of no weighting for three Writing tests. The authors 
documented relatively small decreases in the precision of test 
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scores when a limited number of c.r. items were weighted. 

A third way of increasing the representation of c.r. items 
is the summing, rather than averaging (and if necessary rounding 
to the nearest integer), of the ratings of two readers (i.e. 
summed readings or ratings). In addition to the point value of 
the item being doubled the number of score levels for each c.r. 
item is increased from n (the number of levels of the rubric 
including 0) to 2n-l. Summed ratings is more restricted than, tcc 
component weighting in that it requires multiple readers for each 
c.r. response and hence is limited to increasing the points from 
the c.r. items by a factor of two without prohibitively 
increasing the number of raters (and readings). 

The method of summed ratings is imposed through the item 
parameter estimates and thus the latent scale. In contrast tcc 
component weighting is implemented through the score obtained 
after the set of c.r. items, with their rubric-determined point 
values and number of levels, has been scaled with the m.c. items. 
Because the number of levels of the c.r. items is increased with 
summed ratings item reliability may change, potentially affecting 
form reliability and IRT test score information. 

The purpose of this research was to investigate the effect 
of the three different ways of increasing the number of points 
contributed by the c.r. items on the reliability of test scores 
from mixed-item-format tests. The assumption of 

unidimensionality that underlies the accuracy of IRT model-based 
standard error predictions of reliability was initially evaluated 
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for these tests. 



METHOD 



Source Data 

Large samples of students that had taken mixed-format field 
tests for Math at Grades 5 and 8 and Writing at Grades 3 and 8 
were available for a state criterion-referenced testing program. 
Responses to the subset of items in each of the field test forms 
that were later chosen to constitute a complete operational form 
were selected. Consequently the selected items for each 
grade/content area (hereafter forms) represent the operational 
test blueprints. 

Responses to a second prompt were included with each of the 
two Writing forms. Although an item score for an extended 
response to a prompt is computed as an average over a number of 
analytic traits in the testing program, the score on a single 
trait - Organization - was utilized in these analyses. 

Only students who responded to at least 2/3' s of the selected 
items were used. Omits were treated as not correct. 

The number of scored items and their point values (maximum 
number of points) are summarized below. 



Constructed 



Content 



Response 
Multiple Two Six 



Total 

Items 



Total 

Points 



Area 

Math 

Math 



Grade 

5 

8 

3 

8 



Choice Point Point 

35 10 0 

35 10 0 

29 3 2 

25 6 2 



45 

45 

34 

33 



55 

55 

47 

49 



Writing 

Writing 



Analyses 



Construction of Forms 

The subsets of items chosen for the operational tests 
represented a (unweighted) Baseline condition of test-blueprint 
representative forms, assuming that the addition of a second 
prompt to the two Writing tests would be required by the 
blueprint if testing time permitted. 

Several different types of forms that weighted c.r. responses 
were created, each constructed to have the same number of total 
test points and approximate difficulty after weighting as the 
baseline forms from which the item responses were drawn. This 
was accomplished by partitioning c.r. items in a form into two 
matched sets of approximately the same difficulty (when the 
content and the number of the c.r. items permitted), deleting one 
of the sets, and weighting the remaining set. 

Two instances of tcc component weighting were implemented. 
The first weighted the members of one of the sets of c.r. items 
in a form by a factor of two and is referred to as CRx2. The 
even number of c.r. items in the two Math Baseline forms (10) and 
the Grade 8 Writing form resulted in the matched sets being of 
equal size as well as similar content, with most frequently a 
content category of a deleted c.r. item being represented by a 
c.r. item in the remaining weighted set. 

The second instance of tcc component weighting was based on 
the weighting of one of the two e.r. items in each of the two 
Writing forms by a factor of two and is referred to as ERx2. 
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The last type of weighting of the c.r. items. Summed 
Ratings, was created for those tests having c.r. items with more 
than two points (three levels including 0) ; that is, the two 
Writing tests. Only c.r. items with three or more points were 
subjected to a second reading and hence only the two writing 
prompts could have an item score based on a summed rating. One 
of the two prompts in each Writing form was deleted and a summed 
rating item score was obtained for the remaining prompt. Because 
the testing program called for a third, reconciliation reading if 
the two readers differed by more than a point, the item score was 
either a sum of two readings or the sum of three that was 
multiplied by 2/3' s and rounded to the nearest integer. 

Table 1 contains the items and their p-values (average item 
score divided by the maximum number of points) in the matched 
sets of c.r. items used in the creation of the CRx2 , ERx2 , and 
Summed forms of weighted c.r. responses. 

Evaluations of Forms 

Properties of the total test scores derived from the three 
types of forms, employing either tcc component or Summed rating 
weighting, were compared against the criterion baseline forms. 

The relationships between total raw scores and ability were 
examined through comparisons of tecs. Conditional standard 
errors were evaluated through standard error (se) curves. Scale 
scores produced by weighting were compared to those from the 
baseline forms and the magnitude of differences determined. 
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The dimensionality of the baseline forms was evaluated by 
utilizing Poly-Dimtest (Li & Stout, 1995) to detect violations of 
the assumption of unidimensionality. Specifically the presence 
of a significant dimension underlying the c.r. items was 
assessed. 

Rating Process 

Readers were trained to implement scoring rubrics; anchor 
papers, check sets, and read behinds were employed to verify and 
maintain scoring accuracy. Inter-rater reliability studies that 
incorporated second reads for a large sample of students taking 
each test indicated that the percentage of exact agreement on the 
c.r. items in the Math tests ranged between 92.58% and 100.00%. 
Exact agreement rates for the two-point Writing c.r. items ranged 
between 55.67% (66.46% for the second lowest exact rate) and 
87.77%. The exact agreement rates for the selected 
"Organization" trait on the Writing prompts ranged between 58.84% 
and 62.23% with the approximate agreement rates (within one 
point) between 97.97% and 98.99%. 

Scaling Process 

Multiple-choice and open-ended items were scaled together 
using the generalized IRT model. With the generalized model a 
three-parameter logistic model (Lord, 1980) was used for the 
multiple-choice items: 




1 - c 



( 1 ) 
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where A± is the discrimination, Bi is the difficulty, and a is 
the lower asymptote or guessing parameter for item i. 



A generalization of Master's (1982) Partial Credit model was 
used for the c.r. items. This two-parameter partial credit (2PPC) 
model is the same as Muraki's (1992) "generalized partial credit 
model." For a c.r. item with im score levels assigned integer 
scores that ranged from 0 to jjii - 1: 



and y i0 = 0. a , is the item discrimination, Yy is related to the 
difficulty of the item levels: the trace lines for adjacent score 
levels intersect at Yy j a, . 

Parameter Estimation 

Item parameter was conducted using the program PARDUX 
(Burket, 1991; 1995) . Item parameters were estimated using 
marginal maximum likelihood procedures implemented with an EM 
algorithm. Evaluations of the accuracy of the program with 
simulated data (Fitzpatrick, 1990) have found it to be at least 
as accurate as MULTILOG (Thissen, 1986). The ability scale was 
defined by specifying a prior true 0 distribution to have a mean 
of 0.0 and standard deviation of 1.0. Item parameter estimates 
were linearly transformed to a scale score metric by multiplying 




k = 



( 2 ) 



where 






y ik = a,(.k-\)9-Y,yij , 



by 50 and adding 500. The LOSS and HOSS (lowest and highest 
obtainable scale scores) were set for each form to allow for a 
wide range of scale scores that could accommodate different 
weightings of the c.r. items. 

Student Scores 

The relationship between the predicted raw score and the 
ability estimate Q a (tcc) was obtained using the final item 
parameter estimates: 

a me A cr m j A 

E(x. I «„) = w. {£ y>A(e.) * £ »•,£(* - 1 )r„ («,)>, O) 

/= 1 y=l k=\ 

where the predicted total score has been partitioned into 
components for the me multiple choice items and the cr 
constructed response items. For (unweighted) number-correct 
scoring, such as that employed for the baseline forms, the 

weights w ( . and Wj are all equal to 1. 

Each selected c.r. item in the CRx2 forms and selected e.r. 
item in the ERx2 forms had w . ' s set to 2, with again all w ( . for 

the m.c. items set equal to 1. Scoring tables were constructed 
for all forms consisting of the scale scores corresponding to 

A 

integer values of E(X a \O a ) . 

The weight w m , which multiplies each item probability along 

with the weights w ( . or Wj , serves to determine the total number 

of points in the total score. Set to 1 the number of test score 
points is preserved at that for the baseline forms. If allowed 
to decrease between 1 and 0 the number of total score points can 
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be preserved even when c.r. items are weighted by factors 
(weights) that exceed two. 

Information 



The information of the raw score at ability 9 is 






n m, 



-|2 






/=! *=1 



±a\ v ,x,\e) 



( 4 ) 



/=1 



The inverse of these values, plotted for the 9's across the 
ability continuum, constitute the standard error curves for the 
9 and corresponding scale score metrics. 

Total information for each item was obtained by accumulating 
values of equation 4 over the range of ability. 



RESULTS 

Raw Score Statistics 

Descriptive statistics for the Baseline, CRx2, ERx2, or 
Summed forms of the four tests are presented in Table 2. (Forms in 
the sense of differently scored versions of what may be the same 
set of test items.) The four Baseline forms differed in 
difficulty, with average p-values ranging between .375 for the 
difficult Math Grade 8 form and .686 for Writing Grade 3. 

Analyzing forms within meaningful comparison sets: 
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1) Math: CRx2 versus Baseline for Grades 5 and 8 {Math (Two-Point) 

CR Analysis} 

2 ) Writing : CRx2 vs Baseline for Grade 8 {Writing CR (Two-Point) 

Analysis}, and 

3) Writing : ERx2 and Summed vs Baseline for Grades 3 and 8 {Writing 
ER Analysis} 

reveals that the forms are very similar, an expected result given 
the relatively few items per forms that were weighted and the 
similarity in the difficulties of deleted and retained c.r. items. 

The largest differences in form means within the three 
comparison sets was .33 for the Baseline and Summed forms for 
Writing Grade 8 (means of 28.63 minus 28.30, respectively). The 
largest difference from a Baseline standard deviation (sd) was .19 
for the ERx2 form for Writing Grade 8 (8.06 versus 7.87 
{Baseline}, respectively). 

The reliability (stratified alpha) of the Baseline form is 
consistently slightly above that of the CRx2 forms, with the 
largest decrease occurring for Math Grade 5 (.871 versus .831) . 
Test reliability is virtually the same across the Baseline, ERx2, 
and Summed Writing Grade 3 forms but is less for the Baseline 
Grade 8 Writing form (.868) than it is for ERx2 (.894) and Summed 
(.892) versions. The relatively attenuated values for the 
stratified alphas for both Writing Baseline forms reflects the 
inability to include the retained (and weighted) prompt in the 
qomputation of the statistic for the ERx2 and Summed forms. A 
strata size of only one item results in the e.r. item being 
excluded from the computation and subsequently higher stratified 
alphas for the weighted forms (i.e. forms with weighted c.r. 
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responses) . 

Dimensionality 

To evaluate whether the c.r. items in the Baseline forms 
were dimensionally distinct from the m.c. items, Poly-Dimtest (Li 
& Stout, 1995) analyses were conducted using an ATI subtest 
consisting of only c.r. items. The results of these analyses are 
shown in Table 3. All but one Baseline form. Math Grade 5, was 
found to be unidimensional. The Grade 5 Math Baseline form was 
marginally significant at p=.038. 

Although the p-values for the c.r. items were generally 
lower than the m.c. items in each Math form, the ATI subtests for 
both Math forms passed the Wilcoxon rank sum test as implemented 
in Poly-Dimtest using the default significance level of .02. 

TCCs 

Plots of the tcc' s are presented, along with a tabling of the 
pairs of scale scores (SS) and predicted raw score (RS) values, 
for Math Grade 5 in Figure 1. Results for Math Grade 8 were 
similar and are not provided. Predicted scores for the Baseline 
and CRx2 forms are very similar across the ability scale, 
differing by at most 1.39 raw score points (46.80 for Baseline 
versus 45.41) at a scale score of 625. The tec's for the Writing 
Grade 8 CR Analysis in Figure 2 demonstrate even smaller 
differences between predicted scores with a maximum difference of 
.65 (43.54 for Baseline versus 42.89 for CRx2) at a scale score of 
675. 
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The results for the ER Analysis for Writing Grade 8 presented 
in Figure 3 was similar to that seen for the Baseline, Summed, and 
ERx2 forms for Writing Grade 5 (not presented) . Predicted raw 
scores between the LOSS and HOSS for the Summed form differ by no 
more than 1.64 from the Baseline form (24.01 versus 25.65, 
respectively at 475) with even smaller differences between the 
ERx2 and Baseline forms (max. difference of 13.34 - 13.08 = .26 at 
400) . 

Standard Error 

Total item information presented in Table 4 was preliminarily 
evaluated for the items in the four Baseline forms. The location 
of the items, that is the scale score value at which the item 
contributes the maximum information, is also provided. The mean 
information by item type at the bottom of the table indicates that 
the Math c.r. items contributes more than twice the amount of 
information, on average, than the m.c. items (e.g. .045 versus 

. 021 for Grade 8 ) . 

The substantial information contribution of the Math c.r. 
items, exceeding the ratio of point values of the two item types 
(better than two-to-one) , is not seen with the Writing c.r. items. 
The contribution of information by the Writing c.r. items is less 
than two-to-one for the two-point items and between approximately 
three-to-one and four-to-one (.068 versus .017 for Grade 8) for 
the six-point e.r. items. The information value for one of the 
e.r. items in the Grade 3 test (item # 33) is attenuated because 
the absence of students obtaining a perfect score of 6 
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necessitated a collapse of a category. 

The Baseline se curves in the CR and ER Analyses depicted in 
Figures 4 through 7 are the plotted values of the reciprocal of 
item information (equation 4). In Figure 4 for Math Grade 8 (Math 
Grade 5 was similar and is not provided) , the CRx2 form 
demonstrates an 18% increase in standard error over baseline { (13 
- 11) / 1 1 > in the 550 to 565 scale score range where precision is 
the greatest (hereafter point of form targeting) . Scores for the 
CRx2 form are slightly more precise (larger standard error) at the 
lower end of the scale but more than 30% less precise than the 
Baseline scores between 700 and 800 scale score points (e.g. {81— 

62(i}/62= 30.6% at 726 where the "i" indicates an interpolated 
value) . 

The CR Analysis of se curves for Writing Grade 8 in Figure 5 
indicates error for the CRx2 scores is larger than that for the 
Baseline form across the scale score scale, with the difference 
increasing after approximately 550. CRx2 scores have 21% greater 
error where the forms are targeted (23 versus 19 in the vicinity 
of 475) . In the upper portion of the scale, the standard error 
for the CRx2 scores has increased to more than 30% of that for 
Baseline (81 vs 62 (i) at 726). 

Figures 6 and 7 portray the ER Analyses for the two Writing 
forms. With the exception of intervals near the LOSS or HOSS of 
the forms Summed scale scores have a degree of error between that 
of scores for the Baseline and ERx2 forms. At the point of 
targeting Summed and ERx2 scores have standard errors at most two 
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scale score points (less than 11%) from that of the Baseline 
scores (21 for ERx2 versus 19 for Baseline at 471 for Writing 
Grade 8 in Figure 7) . 

Error for the ERx2 and Summed scores increase in the upper 
third of both scales. Relative to the Grade 3 Baseline se of 68 
at 679 in Figure 6, the increased error is 44% ( 9 8 { i } ) and 19% 

(81{i}), respectively. At Grade 8 the increases, relative to a 
Baseline error of 61 at a scale score of 768, are 33% ( 8 1 { i } ) and 
25% (76{i}), respectively. 

Increased C.R. Item Weighting 

By utilizing a value between 0 and 1 for w m in equation 3 
the relative weight applied to the c.r. items can be increased 
beyond a factor of two while preserving the same number of test 
points as the Baseline forms. The effect of increasing the 
relative weight of the retained e.r. item in the Writing Grade. 8 
test to a value of four times the weight of a m.c. item ( ERx4 ) is 
depicted in Figure 8. 

Standard error for ERx4 scores is increased relative to the 
Baseline and other weighted forms. As is the case with the other 
weighted forms, the increment is relatively small in the lower 
portion of the scale (52 {i} vs 44 for an 18% increase at 349) but 
increases throughout the scale. Between 450 and 500, where the 
forms are targeted, the ERx4 scores have 37% more error (26 vs 19) 
which increases to 47% at a scale score of 768 (90 { i } vs 61) . 
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Scale Score Comparisons 



Scale scores were obtained for the Baseline and weighted 
forms through unweighted and weighted raw score-to-scale score 
tables. Figure 9 contains plots (against Baseline) of the CR 
Analyses for the two Math tests and Writing Grade 8. 

Scale scores obtained through weighting the retained c.r. 
items demonstrate a strong linear relationship to Baseline scores, 
with a product moment correlation (r) that exceeds .980 for both 
of the Math tests and a slightly lower .963 for Writing Grade 8. 

Figure 10 depicts the relationship between the forms of the 
ER Analysis of the Writing Grade 3 forms, as well as scale scores 
obtained when weighting the retained e.r. item by a factor of four 
relative to a m.c. item (ERx4) . Similar results, obtained for 
Writing Grade 8, are not presented. 

Scores between the Baseline and the two weighted forms, ERx2 
and Summed, exhibit the high degree of correlation (.974 and .981, 
respectively) expected for forms that share all but one of their 
items, with no signs of non-linearity. ERx4 scores have a 
slightly reduced correlation with Baseline scale scores (.942). 

All the plots demonstrate greater scatter at the ends of the 
scale where error is greater. This is especially prominent at the 
upper portion of the Writing scales presented at the bottom of 
Figure 9 for Grade 8 and in Figure 10 for Grade 3. 

Distributions of scale scores and their differences are 
described in Table 5, including those obtained after weighting 
the c.r. and e.r. items four times that of a m.c. item ( CRx4 and 
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ERx4) . The means and standard deviations of the CRx2 and ERx2 
scale score distributions resemble the corresponding raw score 
distributions in Table 2 in their similarity to the Baseline 
distributions. 

Increasing the weight of the c.r. items by a factor as large 
as four (while maintaining the number of test points) serves to 
further increase the standard deviation of the scores relative to 
Baseline but generally not the means. This may be seen in the 
standard deviations for Writing Grade 8, which starting from a 
Baseline value of 58.15 increases with CRx2 (63.75) and CRx4 
(70.09) as well as ERx2 (60.69) and ERx4 (65.25) . 

The similarity in the means of the weighted form 
distributions to Baseline reflect the comparability of the 
Baseline and reduced length forms containing the weighted c.r. 
items. Consequently the largest differences are between the CRx2 
and CRx4 versus Baseline scale scores for the Grade 8 Writing 
forms (e.g. 502.60 for CRx2 versus 499.69), which reflect the 
relatively larger difference in difficulty between the retained 
and deleted sets of c.r. items for this test (.524 vs .501, 
respectively, in Table 1) . 

Descriptive statistics for the differences between weighted 
form and Baseline scores are found in the right part of Table 5. 
Mean differences involving the Summed, CRx2 and ERx2 scores are 
small. The largest of these, 2.09 for Crx2~Baseline for Writing 
Grade 8, is inflated to a degree because of the difference in 
form difficulty mentioned above. Ten percent of the 3,288 
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students in this sample obtained a CRx2 score that was at least 
16 scale score points less than their Baseline scores (10%ile) 
while 10% received a CRx2 scale score that was at least 21 points 
above their Baseline score. The next largest mean difference for 
Summed, CRx2 or ERx2 scores was a substantially smaller 1.03 for 
the Summed scores for Writing Grade 8. The 10 th and 90th 
percentile for this distribution of differences were -8 and 10, 
respectively . 

An increase in the differences between weighted and Baseline 
scores as the weight given to the c.r. items increase can be seen 
when the CRx4 and ERx4 distribution of differences (relative to 
Baseline ) is compared to the corresponding CRx2 or ERx2 
distribution increase. For example, the CRx4-Baseline 
distribution of differences for Writing Grade 8 has a larger 
mean, sd, and more extreme 10 th and 90 th percentiles (5.29, 27.35, 
-27, and 37, respectively) than the Crx2-Baseline differences 
(2.09, 17.45, -16, and 21, respectively. 

Discussion and Conclusions 

The selection of subsets of items from test-blueprint- 
representative forms of similar content and difficulty permitted 
an evaluation, unconfounded by these factors, of the effects of 
weighting c.r. items on total test scores relative to criterion 
scores of putatively greater generalizability . As expected there 
was a cost in terms of precision of having fewer, though weighted 
(tcc component or Summed), c.r. items across a very wide range of 
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ability . 

The increment in standard error attributed to weighting was 
predictably less in the middle of the scale where the forms were 
targeted. For the particular tests and number of items deleted 
(and weighted) in this study there was between approximately a 5% 
to 20% increase in standard error at this point. Error in scores 
containing weighted c.r items increased more substantially in the 
upper end of the scale where there was a 20 to 45% reduction in 
precision. The magnitude of increase in error and the particular 
portion of the scale where it occurs are determined by the 
locations and amount of information contributed by the deleted 
c.r. items relative to those that are retained. 

The greater difficulty of the c.r. items meant that the 
location of the deleted items would tend to fall in the upper half 
of the scale score range, implying the total information 
contributed by the remaining items would be less in this part of 
the scale (greater error). The weighting of the retained c.r. 
items, though tending to be of the same difficulty as the deleted 
c.r. items, doesn't produce as much information as that 
contributed by the deleted items. Each variance of a weighted 
item in the denominator of equation 4 is multiplied by the square 
of the applied weight. The sum of the item variances subsequently 

increase faster than the square of the sum of derivatives { P, k (0 )} 
for the weighted (and unweighted) items in the numerator, 
resulting in less information and hence greater error. 
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Summed ratings, which increases the relative contribution of 
c.r. items to the total test by adding scoring levels beyond those 
specified by the rubrics rather than multiplying a response by a 
factor, results in total scores with standard errors less than 
that of the tcc component weighted scores throughout most of the 
score range. Summed ratings result in greater error than Baseline 
because the amount of information accrued from the additional 
levels is not twice the amount contributed by an e.r. item in the 
tests employed in the study. It is conceivable, if not likely, 
that there may be some c.r. items in other tests from which 
information gains of this magnitude could be attained. 

Weighting from one through five student constructed responses 
by summing or multiplying by a factor of two (Crx2 and Erx2 
analyses) resulted in differences in scale scores that most 
frequently (80%) differed by no more than 13 scale score points 
from those obtained when additional items were administered. A 
small difference in the difficulties of deleted and retained c.r. 
items contributed to slightly larger differences for the Writing 
Grade 8 test. Quadrupling the c.r. weighting substantially 
increased the mean differences and came close to doubling the 10th 
and 90 th percentile scale score differences. 

The greater unreliability in the scoring of the Writing as 
opposed to the Math c.r. items likely contributed to the greater 
differences for this content area. The potential to increase 
score precision by improved rubrics and scoring, along with the 
magnitude of error at important portions of the scale, such as 
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cutscores, should be addressed prior to weighting c.r. items. 

There are several other validity-related considerations that 
need to inform a decision to weight. The dimensionality 
assessments of the Baseline forms indicated one test - Math Grade 
5 - was not unidimensional, having a significant second dimension 
defined by the c.r. items. If the multidimensionality is due to 
an enduring domain attribute or proficiency rather than a 
characteristic unique to the particular sampled c.r. items there 
is a potential impact on important psychometric functions such as 
form equating. Tcc component weighting may pose less of a 
problem than Summed Ratings under these circumstances because of 
its implementation "outside" of the IRT scale. 

The effects of weighting on score precision and the threat 
that multidimensionality impairs the accuracy of the standard 
errors must be evaluated in light of the purpose of testing. 
Higher stakes testing, with the greater consequences for the 
student that attend score interpretation, requires at the very 
least a documentation of the sources and magnitude of 
disturbances to model-based reliability estimates as a 
prerequisite to a valuation. It would also seem to require a 
demonstration of how greater validity is obtained by increasing 
the representation of c.r. items through weighting rather than 
the number of items. Pursuant to that goal would be the 
presentation of evidence that the assessment of content or 
processes are sufficiently important to justify weighting rather 
than an increase in testing time. 
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Table 1 

Retained and Deleted C.R. Item Sets 













CRx2 














Math 5 






Math 8 






Writing 8 




Retained 


Deleted 


Retained 


Deleted 


Retained 


Deleted 


Item 


P-value 


Item 


P-value 


Item 


P-value 


Item 


P-value 


Item 


P-value 


Item 


P-value 


6 


0.283 


9 


0.032 


4 


0.122 


10 


0.047 


5 


0.563 


13 


0.471 


26 


0.021 


15 


0.059 


18 


0.148 


15 


0.186 


10 


0.245 


24 


0.364 


28 


0.414 


20 


0.335 


27 


0.073 


21 


0.227 


18 


0.766 


27 


0.649 


38 


0.106 


33 


0.185 


41 


0.072 


32 


0.145 


32 


0.521 


33 


0.518 


42 


0.095 


35 


0.334 


42 


0.300 


36 


0.094 










Mean 


0.184 




0.189 




0.143 




0.140 




0.524 




0.501 


SD 


0.161 




0.145 




0.094 




0.072 




0.214 




0.118 





ERx2 and Summed 




Writinq 3 


Writing 8 


Retained 


Deleted 


Retained 


Deleted 


Item P-value 


Item P-value 


Item P-value 


Item P-value 


34 0.490 


33 0.471 


32 0.521 


33 0.518 
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Table 2 

Raw Score Descriptive Statistics 
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Table 3 

Poly-Dimtest Significance Tests for the 
Hypothesis of Unidimensionality 



Baseline 




No. 






Content Grade 


Items 


T 


p -value 


Math 5 


45 


1.779 


0.038 


8 


45 


-1.070 


0.858 



Writing 


3 


34 


0.625 


0.266 




8 


33 


-0.849 


0.802 



* p < .05 
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' Table 4 

Item Total Information for the Baseline Forms 







Math 










Writing 






Grade 5 


Grade 8 


Grade 3 


Grade 8 


Item 


item 


Total 


Item 


Totai 


Item 


Totai 


Item 


Totai 


No. 


Location # 


Info.* 


Location # 


Info.* 


Location 


Info.* 


Location # 


Info.* 


1 


550 


0.016 


533 


0.016 


460 


0.019 


490 


0.014 


2 


563 


0.012 


568 


0.015 


461 


0.015 


489 


0.008 


3 


487 


0.019 


548 


0.020 


453 


0.025 


438 


0.014 


4 


388 


0.006 


586 


0.033 1 


522 


0.021 


623 


0.008 


5 


516 


0.015 


567 


0.024 


474 


0.019 


476 


0.026 1 


6 


573 


0.037 ’ 


586 


0.020 


494 


0.027 


462 


0.016 


7 


470 


0.017 


508 


0.016 


471 


0.027 


454 


0.015 


8 


458 


0.024 


610 


0.010 


466 


0.022 


498 


0.015 


9 


617 


0.059 1 


554 


0.027 


464 


0.034 


445 


0.017 


10 


558 


0.016 


647 


0.028 1 


547 


0.013 


640 


0.018 1 


11 


552 


0.010 


607 


0.018 


517 


0.028 


454 


0.024 


12 


576 


0.027 


551 


0.032 


506 


0.033 


523 


0.015 


13 


592 


0.018 


497 


0.016 


488 


0.021 


507 


0.028 1 


14 


492 


0.026 


560 


0.009 


506 


0.024 


398 


0.016 


15 


607 


0.045 1 


560 


0.045 1 


510 


0.020 


576 


0.010 


16 


438 


0.014 


541 


0.009 


498 


0.027 


484 


0.020 


17 


571 


0.014 


575 


0.007 


498 


0.028 


524 


0.021 


18 


584 


0.014 


585 


0.031 1 


486 


0.020 


439 


0.026 1 


19 


426 


0.015 


593 


0.003 


534 


0.014 


514 


0.016 


20 


529 


0.033 1 


547 


0.042 


574 


0.021 


497 


0.022 


21 


558 


0.014 


557 


0.034 1 


487 


0.029 


1 529 


0.027 


22 


575 


0.014 


502 


0.018 


470 


0.017 


554 


0.021 


23 


545 


0.010 


531 


0.019 


458 


0.021 


471 


0.019 


24 


602 


0.020 


583 


0.023 


453 


0.023 


540 


0.025 1 


25 


557 


0.008 


614 


0.017 


459 


0.029 


1 476 


0.014 


26 


634 


0.044 1 


586 


0.014 


559 


0.026 


594 


0.012 


27 


557 


0.016 


585 


0.084 1 


519 


0.023 


455 


0.023 1 


28 


514 


0.034 1 


549 


0.017 


442 


0.031 


443 


0.018 


29 


478 


0.018 


547 


0.027 


477 


0.018 


479 


0.022 


30 


561 


0.017 


534 


0.039 


499 


0.024 


453 


0.031 


31 


520 


0.018 


547 


0.052 


429 


0.029 


1 508 


0.017 


32 


523 


0.028 


567 


0.062 1 


417 


0.021 


411 


0.076 2 


33 


572 


0.028 1 


489 


0.013 


404 


0.060 


3 400 


0.061 2 


34 


505 


0.017 


509 


0.014 


405 


0.063 


2 




35 


539 


0.032 1 


571 


0.032 










36 


584 


0.037 


593 


0.035 1 










37 


557 


0.028 


484 


0.016 










38 


601 


0.044 1 


566 


0.033 










39 


539 


0.025 


550 


0.030 










40 


537 


0.012 


560 


0.039 










41 


570 


0.027 


584 


0.069 1 










42 


601 


0.032 1 


546 


0.029 1 










43 


525 


0.015 


551 


0.030 










44 


590 


0.014 


562 


0.017 










45 


425 


0.013 


531 


0.011 












Mean m.c. 


0.017 




0.021 




0.023 




0.017 




SD 


0.007 




0.011 




0.005 




0.006 


Mean 2-point c.r. 


0.039 




0.045 




0.029 




0.024 




SD 


0.009 




0.020 




0.000 




0.004 


Mean 6-point c.r. 


- 




- 




0.062 




0.068 




SD 


- 




- 




0.002 




0.010 



*Area under the information function 
# Point of maximum information 

1 Two-point CR items 

2 Six Point Writing Prompt 

3 Writing prompt with a maximum score of 5 after collapsing one level 
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Figure 1 

Test Characteristic for the Math Grade 5 Forms 




300 350 400 450 500 550 600 650 700 750 800 

Scale Score 

[— Baseline CRx2 



ss 


Baseline 




CRx2 


RS 


SE 


RS 


SE 


300 


7.88 


279.79 


7.77 


270.02 


325 


8.15 


184.73 


8.06 


178.35 


350 


8.57 


121.67 


8.49 


117.88 


375 


9.22 


81.06 


9.17 


79.21 


400 


10.20 


55.21 


10.21 


54.84 


425 


11.68 


38.66 


11.76 


39.41 


450 


13.86 


28.13 


14.02 


29.74 


475 


16.91 


21.74 


17.14 


24.03 


500 


20.92 


18.05 


21.11 


20.84 


525 


25.76 


15.91 


25.76 


18.96 


550 


31.18 


14.64 


30.87 


17.87 


575 


36.81 


14.23 


36.13 


18.20 


600 


42.17 


14.90 


41.07 


20.16 


625 


46.80 


16.72 


45.41 


22.90 


650 


50.21 


20.78 


48.94 


26.95 


675 


52.30 


27.92 


51.40 


34.12 


700 


53.46 


38.38 


52.89 


45.09 


725 


54.10 


52.16 


53.74 


59.79 


750 


54.46 


69.59 


54.24 


78.37 


775 


54.67 


91.38 


54.53 


101.51 


800 


54.80 


118.62 


54.71 


130.35 
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Figure 2 

Test Characteristic Curves for the Writing Grade 8 Forms: 
CRx2 and Baseline 
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Baseline 


CRx2 


RS 


SE 


RS 


SE 


300 


5.98 


98.80 


6.05 


107.17 


325 


6.73 


63.63 


6.84 


73.32 


350 


8.00 


42.77 


8.15 


52.15 


375 


10.06 


30.91 


10.22 


39.33 


400 


13.08 


24.43 


13.23 


31.67 


425 


16.94 


20.89 


17.06 


26.92 


450 


21.28 


19.08 


21.40 


24.16 


475 


25.65 


18.71 


25.75 


23.32 


500 


29.71 


19.28 


29.76 


23.81 


525 


33.27 


20.67 


33.22 


25.59 


550 


36.23 


23.29 


36.04 


29.49 


575 


38.54 


27.43 


38.20 


36.10 


600 


40.29 


32.65 


39.81 


44.57 


625 


41.62 


38.24 


41.04 


53.33 


650 


42.67 


43.65 


42.04 


61.40 


675 


43.54 


48.48 


42.89 


68.31 


700 


44.29 


52.44 


43.65 


73.89 


725 


44.96 


55.61 


44.34 


78.31 


750 


45.58 


58.38 


45.00 


82.13 


775 


46.15 


61.33 


45.61 


86.07 


800 


46.67 


64.99 


46.17 


90.79 


825 


47.13 


69.81 


46.67 


96.84 


850 


47.52 


76.18 


47.11 


104.61 


875 


47.85 


84.40 


47.49 


114.38 


900 


48.12 


94.74 


47.81 


126.38 
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Figure 3 

Test Characteristic Curves for the Writing Grade 8 Forms: 
ERx2, Summed and Baseline 
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20.89 


17.15 


24.77 


15.53 


23.33 
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27.43 
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28.23 
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32.65 
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650 
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750 


45.58 
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77.25 
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775 
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77.96 


800 
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64.99 
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45.41 


86.55 


825 
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850 
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47.49 


104.13 


46.01 


108.82 


875 
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84.40 


47.81 


114.84 


46.22 


122.70 


900 


48.12 


94.74 


48.07 


127.84 


46.38 


138.53 




31 



35 



Figure 4 

Standard Error Curves for the CR Analyses of Math Grade 8 
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Figure 5 

Standard Error Curves for the CR Analyses of Writing Grade 8 
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Figure 6 

Standard Error Curves for the ER Analyses of Writing Grade 3 
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1 A maximum of 45, rather than 47 points is possible because of the collapse of the uppermost 
category for each Writing prompt (0 and 1 student obtained a perfect score). 

2 A maximum of 44, rather than 47 points is possible because of the absence of students in the 
three highest categories for the Summed Writing rating prompt. 

3 A maximum of 45, rather than 47 points is possible because of the collapse of the uppermost 
category in the doubled Writing prompt. 





Figure 7 

Standard Error Curves for the ER Analyses of Writing Grade 8 
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'A maximum of 47, rather than 49 points is possible because of the absence of 
students in the two highest categories of the Summed Rating Writing prompt. 
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Figure 8 

Writing Grade 8: Multiple Weighting Types 
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students in the two highest categories of the Summed Rating Writing prompt. 
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Figure 9 

CRx2 Weighted Scale Scores versus Baseline 
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Figure 10 

Writing Grade 3 Weighted versus Baseline Scale Scores 
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